Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Batch vote metrics (BFT-486) #157

Merged
merged 7 commits into from
Jul 24, 2024
Merged

feat: Batch vote metrics (BFT-486) #157

merged 7 commits into from
Jul 24, 2024

Conversation

aakoshh
Copy link
Contributor

@aakoshh aakoshh commented Jul 23, 2024

What ❔

Adds metrics around the gossiping of batch votes and the storage of batches.

Added gossip::metrics::BATCH_VOTES_METRICS:

  • network_gossip_batch_votes_committee_size to show how many attesters we have, which can be correlated with the number of votes we receive over time. Wanted to add a weight based one as well, but not sure if counting the weights monotonically would not overflow the counter.
  • network_gossip_batch_votes_votes_added counts the votes received since start; with its rate-of-change we can get an idea of how many attesters are producing votes
  • network_gossip_batch_votes_weight_added counts the total weight of votes added since start, each normalized to [0; 1] range with the total committee weight; this way it shouldn't overflow, and its rate of growth should be 0.8-1.0 per batch.
  • network_gossip_batch_votes_min_batch_number shows the minimum batch we expect votes for; going up will indicate that QCs are formed
  • network_gossip_batch_votes_last_added_vote_batch_number is the number from the last vote we registered; apart from Byzantine votes it should average to go up as L1 batches are created and votes are received, even if a QC cannot be formed
  • network_gossip_batch_votes_last_signed_batch_number indicates that the current node is publishing its votes and keeping up with the L1 batches

Also implement metrics for the BatchStore similar to how the BlockStore has them:

  • next for queued and persisted
  • latencies for the PersistedBatchStore operations
  • latencies for the wait_for_... methods
  • zksync_consensus_storage_batch_store_last_persisted_batch_qc: last batch QC saved

The latencies for some of the BatchStore methods could potentially be less cumbersome if we used instrumentation.

Why ❔

To make the voting process observable.

@aakoshh aakoshh force-pushed the bft-486-batch-metrics branch from 928cf60 to 8a44e6d Compare July 23, 2024 14:10
@aakoshh aakoshh marked this pull request as ready for review July 23, 2024 15:06
@aakoshh aakoshh requested a review from pompon0 as a code owner July 23, 2024 15:06
pub(crate) last_added_vote_batch_number: vise::Gauge<u64>,

/// Batch number of the last batch signed by this attester.
pub(crate) last_signed_batch_number: vise::Gauge<u64>,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can postpone this, but can we have a histogram that would show the distribution of votes delay after an l1 batch was produced?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds interesting, I will create a ticket for this. There are a few challenges:

  • the timestamp of the batch isn't available here
  • currently we can receive votes earlier than we have the batch itself, so we would have to store a timestamp with the vote, and the retrospectively emit negative latencies when the batch is produced on the local node

We can have some specific workarounds:

  • For example when the last_signed_batch_number above changes it can act as an approximation of when the L1 batch was produced, but it's only an approximation because it depends on the polling frequency, and also it's only available on nodes which are attesters.
  • We could observe the time elapsed between the first and last vote. For attester nodes this would indicate the speed of propagation of votes over gossip, compared to the vote produced locally, although again the local vote might come later than the first gossiped one.
  • We could implement Grzegorz's idea about only gossiping votes to nodes which have indicated they have the batch already, which would deal with the negative values; this simplifies the metrics in that we don't have to buffer timestamps and observe them the first time we learn about the batch.
  • We could observe the difference between l1_batches.created_at and l1_batches_consensus.created_at when we insert the QC, which would be the overall latency of gossiping votes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aakoshh aakoshh merged commit 1075fc6 into main Jul 24, 2024
5 checks passed
@aakoshh aakoshh deleted the bft-486-batch-metrics branch July 24, 2024 08:48
@aakoshh
Copy link
Contributor Author

aakoshh commented Jul 25, 2024

Added the following metrics to the Consensus Dashboard on Grafana:

image

The queries behind them are the following:

  • max (network_gossip_batch_votes_last_signed_batch_number) by (pod) != 0: last vote published by attesters
  • max (zksync_consensus_storage_batch_store_last_persisted_batch_qc) by (pod): last batch QC saved by nodes
  • max (network_gossip_batch_votes_last_added_vote_batch_number) by (pod): last vote received and added by nodes
  • min (network_gossip_batch_votes_min_batch_number) by (pod): minimum accepted vote by node
  • max(rate(network_gossip_batch_votes_votes_added[10m]) * 600) by (pod): number of votes added per minute over the last 10 minutes
  • max(network_gossip_batch_votes_committee_size) by (pod): committee size to compare against number of votes added

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants